scRNA-seq data poses a philosophical challenge to science. To even begin to understand it, we have to use dimensionality reduction methods, which inherently makes them very difficult to compare - “ground truth” in this case is the incomprehensible pile of data that is the count matrix.
I looked into the data from Jansky et al.. They scRNA-sequenced the developing adrenal gland from human embryos, and I used only the adrenal medulla subset of that data, as that is the site of interesting biology for neuroblastoma. I used primarily UMAP and multiscale PHATE for dimensionality reduciton. I compared the effects of feature selection (all genes - noisy plots, variable genes - tighter plots) and scaling (uncertain, but no advantage of scaled over unscaled) on the resulting plots, and zoomed into the bridge-connecting chromaffin-neuroblast trajectory, finding a potential new subpopulation between connecting chromaffins and early neuroblasts. Lastly, I ran multiscale PHATE on combined data from Jansky et al. and Kameneva et al., finding an unexpected fluke of unknown origin that would have otherwise gone unnoticed.
The basic premise of neuroblastoma is that neuroblasts fail to progress to their final differentiation state (stopping at different points on their trajectory for different patients). They arise from bridge cells (themselves from migrating neural crest cells). The trajectory is as follows: bridge into Schwann Cell Precursors (SCPs) and connecting chromaffins, connecting chromaffins into mature chromaffins and neuroblasts.
I compared a newly written dimensionality reduction method, multiscale PHATE, to UMAP, PCA, and tSNE. The latter three are built into Seurat, but the former has to be run in a separate Python script (it can also be run inside an R notebook such as this, but I found that to be slower). Briefly, msPHATE first turns the data into a transition probability matrix, performs a random walk using that matrix, finds the informational distance between resulting points, and plots that as the dimensionality reduced embedding (this is the PHATE component). Then, the data is progressively smoothed, magnifying any pre-existing density heterogeneity and generating natural groupings at multiple scales, resulting in a 3D “tree” that starts with very fine points at its base (individual cells) and ends with coarser natural groupings at the top. The multiple “salient levels of resolution” that one can slice the tree at are also used to break up the embedding into different numbers of clusters, which is what we are interested in for finding cell types within our data.
Another great thing about msPHATE is that the clusters it generates (at any coarseness) can be zoomed into (called “re-embedding” in the literature). I only used the finest coarseness (i.e., each point on the embedding corresponds to a single cell) because all metadata in the Seurat object needs to be for individual cells, but the zooming in capability was very useful. The end point of this project was zooming into the bridge - connecting chromaffin - neuroblast axis in the hope that sequences of a future neuroblastoma cell line experiment at CCB will map at different points on the same trajectory.
Please refer to the msPHATE GitHub page and article for more details, validation, and pretty pictures. I like this method because it can be used on just about any data and seems to have fewer compromises than UMAP.
This is the code I used to run msPHATE.
### USAGE ###
# Export counts.csv. gene_names.txt and cell.names.txt from R
# Now you can load in data by running this script (careful, this restarts the shell!)
# Find msPHATE clusters at different levels with clus
### USAGE ###
import multiscale_phate as mp
import numpy as np
import pandas as pd
import scprep
import os
wdir = '~/Jansky/ms_phate/msphate_jansky/msPHATE-Seurat'
npca = 20 # Default number for analyses in R
gran = 0.1 # Default msPHATE parameter
wdir = os.path.expanduser(wdir)
gene_path = os.path.join(wdir, 'gene_names.txt')
cell_path = os.path.join(wdir, 'cell_names.txt')
print('Loading scaled data...')
data_path = os.path.join(wdir, 'counts.csv')
data = scprep.io.load_csv(data_path, cell_axis='column',
gene_names=gene_path,
cell_names=cell_path)
mp_op = mp.Multiscale_PHATE(n_pca=npca, granularity=gran, random_state=0)
levels = mp_op.fit(data)
def clus(clustering_level, vis_level = 0):
clus_level = clustering_level
print('Generating embedding and assigning clusters...')
embedding, clusters, sizes = mp_op.transform(visualization_level = levels[vis_level],
cluster_level = levels[-1*clus_level])
print('Exporting embedding and clusters...')
msPHATE_embedding = pd.DataFrame(embedding,
index = pd.read_csv(cell_path, header=None).iloc[:, 0].to_list())
msPHATE_embedding.to_csv('msPHATE_embedding.csv', header=False)
msPHATE_clusters = pd.DataFrame(clusters)
msPHATE_clusters.to_csv('msPHATE_clusters.csv', header=False, index=False)
print('\n')
print(data)
print('\n')
print(levels)
print('\n')
print(embedding)
print('\n')
print(set(clusters))
print ('\n')
print(' Clustering level ' + str(clus_level))
print('\n')
print('Done')
This code is to load in previously made objects without rerunning intensive analyses.
## RESTART ##
wdir <- '~/msPHATE-Seurat/'
data_name <- 'seurats/med_final.RDS'
med <- readRDS(paste(wdir, data_name, sep = ''))
data_name <- 'seurats/bcn_final.RDS'
bcn <- readRDS(paste(wdir, data_name, sep = ''))
data_name <- 'seurats/bcn_ms_final.RDS'
bcn_ms <- readRDS(paste(wdir, data_name, sep = ''))
data_name <- 'seurats/SeuratProject.h5Seurat'
comb <- LoadH5Seurat(file = paste(wdir, data_name, sep = ''))
Jansky et al. used the top 2000 most variable genes for dimensionality reduction, standard procedure in scRNA-seq analyses. What would happen if we used all genes instead?
The top 2000 genes are already stored as scale.data in the Seurat file provided by Jansky et al. This is the code to export the data from R and import msPHATE outputs into the Seurat object.
write.table(GetAssayData(med, 'scale.data'), file = paste(wdir, 'counts.csv', sep = ''), sep = ',', row.names = FALSE, col.names = FALSE)
write.table(colnames(GetAssayData(med, 'scale.data')), file = paste(wdir, 'cell_names.txt', sep = ''), sep = ',', row.names = FALSE, col.names = FALSE)
write.table(rownames(GetAssayData(med, 'scale.data')), file = paste(wdir, 'gene_names.txt', sep = ''), sep = ',', row.names = FALSE, col.names = FALSE)
# Files written are run through msPHATE in Python
At this point, I run msphate.py and clus(3)
msPHATE_embedding_global <- read.csv(paste(wdir, 'msPHATE_embedding.csv', sep = ''), header=FALSE, row.names = 1, col.names = c('', 'msPHATE_1', 'msPHATE_2'))
msPHATE_embedding_global <- as.matrix(msPHATE_embedding_global)
med[['msphate_2k']] <- CreateDimReducObject(embeddings = msPHATE_embedding_global, key = 'msPHATE_')
# Only run once - embedding constant for different clustering levels
msPHATE_clusters_global <- as.matrix(read.csv(paste(wdir, 'msPHATE_clusters.csv', sep = ''), header=FALSE))
msPHATE_clusters_global <- factor(msPHATE_clusters_global)
med$msphate_clusters_lvl3_2k <- msPHATE_clusters_global
# Repeat clus(n) and this last step to store clusters at different levels in the Seurat object
This is the code to export scaled data for all genes. The additional table is generated in a copy of the Seurat object (later deleted to save space). This is then stored in the misc slot of the original Seurat object.
med_all <- med
med_all <- NormalizeData(med_all)
med_all <- ScaleData(object = med_all, features = rownames(med_all@assays$RNA@counts))
Misc(med, 'scale.data_all') <- GetAssayData(med_all, 'scale.data')
rm(med_all)
write.table(med@misc$scale.data_all, file = paste(wdir, 'counts.csv', sep = ''), sep = ',', row.names = FALSE, col.names = FALSE)
write.table(colnames(med@misc$scale.data_all), file = paste(wdir, 'cell_names.txt', sep = ''), sep = ',', row.names = FALSE, col.names = FALSE)
write.table(rownames(med@misc$scale.data_all), file = paste(wdir, 'gene_names.txt', sep = ''), sep = ',', row.names = FALSE, col.names = FALSE)
The msPHATE analysis is done on data for all genes as before, and this is the code for UMAP and tSNE on the same data, again by using a temporary copy of the Seurat object.
med[['pca_all']] <- CreateDimReducObject(embeddings = Embeddings(med[['pca']]), key = 'PC_')
med@reductions$pca <- NULL
umap_tsne_all <- med
med$jansky_idents <- Idents(med)
all_genes <- rownames(GetAssayData(umap_tsne_all, 'scale.data'))
umap_tsne_all <- ScaleData(umap_tsne_all, features = all_genes)
umap_tsne_all <- RunPCA(object = umap_tsne_all, features = all_genes, reduction.name = 'pca_all')
umap_tsne_all <- FindNeighbors(object = umap_tsne_all, reduction = 'pca_all')
umap_tsne_all <- FindClusters(object = umap_tsne_all, resolution = 1.2)
umap_tsne_all <- RunUMAP(object = umap_tsne_all, features = all_genes)
umap_tsne_all <- RunTSNE(object = umap_tsne_all, features = all_genes)
med <- RunTSNE(object = med, features = VariableFeatures(med), reduction = 'pca_all')
rm(umap_tsne_all)
med[['umap_2k']] <- CreateDimReducObject(embeddings = Embeddings(med[['umap']]), key = 'UMAP_')
med[['tsne_2k']] <- CreateDimReducObject(embeddings = Embeddings(med[['tsne']]), key = 'UMAP_')
med@reductions$umap <- NULL
med@reductions$tsne <- NULL
med[['umap_all']] <- CreateDimReducObject(embeddings = Embeddings(umap_tsne_all[['umap']]), key = 'UMAP_')
med[['tsne_all']] <- CreateDimReducObject(embeddings = Embeddings(umap_tsne_all[['tsne']]), key = 'UMAP_')
Let’s see the results! To set our bearings straight, here are the UMAP and msPHATE embeddings of the same data coloured by final identities assigned by Jansky et al. (stored as jansky_idents in the metadata).
DimPlot(med, reduction = 'msphate_2k', group.by = 'jansky_idents', pt.size = 1) + ggtitle('msphate_2k by jansky_idents')
DimPlot(med, reduction = 'umap_2k', group.by = 'jansky_idents', pt.size = 1) + ggtitle('umap_2k by jansky_idents')
Immediately, we can see that UMAP is still better for communicating the basic differentiation trajectory: bridge cells go into SCPs or connecting chromaffins, the latter differentiating into neuroblasts or mature chromaffins.
The topology is the same in the msPHATE embedding, but the trajectory is presented as much less linear, with a huge tangle of cells at the bridge-neuroblast-chromaffin junction.
Let’s see how msPHATE broke up this data into clusters.
What I notice about this is that some clusters are very stable with increasing clustering level. This could be measured quantitatively by calculating the msPHATE cluster composition of jansky_idents and its stability at different levels. However, just by eye (and for lack of time…) I can see that late SCPs, bridge + connecting chromaffins, and neuroblasts proper are the most stable as msPHATE clusters.
Now that we have our bearings, let’s actually look at the effect of
feature selection.
The UMAP plot preserves overall topology very well, even though now the entire embedding is much noisier and more diffuse. To be precise, I would have had to redo the annotation by comparing cell type marker expression for this embedding, but the originally assigned identities are close enough and provide a visual aid to trace how the position of cells changes in different embeddings.
The msPHATE is completely different. Perhaps the algorithm assigns
more weight to the non-variable genes than UMAP, and this shows in the
embedding? Let’s see if msPHATE clusters in this case still correspond
well to jansky_idents in this case.
The first thing we see is that there is not the dramatic jump in number of clusters with increasing clustering level that we saw for top 2000 variable genes. I interpret this as the top 2000 gene data containing more information, and hence msPHATE uncovering a lot more clusters than for the more homogeneous all gene data. The abstract signal-to-noise ratio is lower in this case. However, is it right to call it noise? Perhaps we are seeing the high-level structure of the data which may or may not correspond to the clusters based on most variable genes.
What is this “high-level structure”? I can see that neuroblasts and SCPs proper are stably lumped with bridge and proper + connecting chromaffins. The other clusters have no clear pattern. Again, these clusters may or may not be interesting.
Let’s quickly check how these clusters compare to msphate_2k clusters.
DimPlot(med, reduction = 'msphate_all', group.by = 'msphate_clusters_lvl5_all', pt.size = 1) + ggtitle('msphate_all at lvl5')
DimPlot(med, reduction = 'msphate_all', group.by = 'msphate_clusters_lvl3_2k', pt.size = 1) + ggtitle('msphate_all by msphate_clusters_lvl3_2k') # Note that group.by can be done for any metadata variable, making this the best way of storing cluster identities arising from different methods
Since we found earlier that msphate_2k clusters corresponded well to jansky_idents, they don’t correspond well to msphate_all clusters, as expected.
Quick note: there is a lot about UMAP clusters that I didn’t test. Firstly, the clusters technically arise from the Louvain algorithm and should be called that instead. Secondly, the resolution parameter can be adjusted to give the same number of clusters as msPHATE to make them comparable, something I couldn’t do in the time I had. Here is a taster of what I might have done.
DimPlot(med, reduction = 'umap_2k', group.by = 'msphate_clusters_lvl4_2k', pt.size = 1) + ggtitle('umap_2k by msphate_clusters_lvl4_2k')
DimPlot(med, reduction = 'umap_2k', group.by = 'seurat_clusters', pt.size = 1) + ggtitle('umap_2k by seurat_clusters') # Louvain algorithm clusters are stored as metadata variable seurat_clusters and used for both UMAP and tSNE
DimPlot(med, reduction = 'umap_2k', group.by = 'jansky_idents', pt.size = 1) + ggtitle('umap_2k by jansky_idents')
Two big differences here. For example, msPHATE incorporates into connecting chromaffins half of what Louvain combines with the bridge population. It also combines into one a lot of the neuroblast clusters that the Louvain algorithm separated. Would Jansky et al. have separated neuroblasts or chromaffins the way they did had they used msPHATE..?
Finally, a quick look at tSNE.
The embedding (shape) doesn’t change much. I haven’t used tSNE very much in this project, but it can’t be neglected since it can be useful for communicating certain things, such as composition without trajectory. In this case, connecting chromaffins are clearly placed in the centre, emphasising them as the junction between bridge, chromaffins, and neuroblasts.
To continue comparing msPHATE and UMAP, we need to use the most variable genes. However, before subsetting the trajectory we are interested in (as a reminder, it’s bridge - connecting chromaffins - neuroblasts), there is one more decision to make. As I understand it, scaling means representing the data as number of standard deviations from the mean (z-score) after the data is mean-centred (i.e., the mean is set to zero).
What this means is that for genes with a small variance (i.e., most cells have a similar value for this gene), scaling widens the distribution, increasing the numeric value of each expression measurement for that gene. For high variance genes, the opposite happens - moving a set number of measurements away from the distribution centre would lead to a smaller change in numerical value when scaled. From this, I expect scaling to increase the effect of low-DE genes on the embedding, and decrease that of high-DE genes.
As before, the unscaled data is stored in the misc slot of our Seurat
object. It is then exported, run through msPHATE at different levels,
re-imported, and UMAP rerun on the unscaled data.
As expected, scaled embeddings are tighter and unscaled embeddings are sparser. Also, scaling seems to separate cells based on lesser differences. This might explain why SCPs and bridge cells are only connected in the scaled UMAP: the difference must be very small, and hence amplified by scaling.
The clusters correspond remarkably well to jansky_idents, even though the latter were derived from scaled data. Connecting chromaffins and bridge cells did not separate at this clustering level, and so it wasn’t used for zooming into the trajectory of interest.
A character defined by a clear gene module is cell cycling state,
stored as a metadata variable.
It is interesting to see that there is no clear effect of scaling, contrary to my predictions earlier. The only difference is that UMAP clusters them much closer together than msPHATE in both cases.
This is how the subsets were defined. The subsequent analysis is the same as above.
bcn <- subset(med, subset = jansky_idents %in% c('Bridge', 'connecting Chromaffin cells', 'Neuroblasts', 'cycling Neuroblasts', 'late Neuroblasts'))
bcn_ms <- subset(med, subset = msphate_clusters_lvl4_2k %in% c('0', '336', '82', '1', '4607', '6', '4829', '3', '5', '2', '1814', '1838'))
rm(med)
bcn_ms <- subset(bcn_ms, subset = jansky_idents %in% c('Bridge', 'connecting Chromaffin cells', 'Neuroblasts', 'cycling Neuroblasts', 'late Neuroblasts'))
As a reminder, the msPHATE clusters to zoom into were chosen to
visually correspond to the jansky_idents of interest. Only scaled
clusters were chosen because they better corresponded to jansky_idents,
and also in the interests of time…
Let’s load in the previously generated Seurat objects subsets containing bridge, connecting chromaffin, and neuroblast cells.
Will there be a significant difference between filtering by msPHATE clusters first and only filtering by jansky_idents?
DimPlot(bcn, reduction = 'msphate_scaled', group.by = 'jansky_idents', pt.size = 1) + ggtitle('bcn msphate_scaled by jansky_idents')
DimPlot(bcn_ms, reduction = 'msphate_scaled', group.by = 'jansky_idents', pt.size = 1) + ggtitle('bcn_ms msphate_scaled by jansky_idents')
DimPlot(bcn, reduction = 'msphate_unscaled', group.by = 'jansky_idents', pt.size = 1) + ggtitle('bcn msphate_unscaled by jansky_idents')
DimPlot(bcn_ms, reduction = 'msphate_unscaled', group.by = 'jansky_idents', pt.size = 1) + ggtitle('bcn_ms msphate_unscaled by jansky_idents')
The msPHATE-first, unscaled zoom best communicates the linear trajectory we are interested in. Note that both cases, and unlike with the complete dataset, scaling “pulls out” cycling cells, while using unscaled data leaves them entirely within their original population.
The same subset can be viewed as a UMAP plot.
DimPlot(bcn_ms, reduction = 'umap_unscaled', group.by = 'jansky_idents', pt.size = 1) + ggtitle('bcn_ms umap_unscaled by jansky_idents')
The progression from bridge to late neuroblasts is very clear (even clearer than msPHATE below). UMAP is again proving to be best for visualising trajectories.
Let’s look into how these identities correspond to msPHATE clusters
Both plots show a new population of cells between clear chromaffins and clear neuroblasts (cluster 3 for scaled and unscaled).
To investigate this, we introduce a heatmap plot, a way of showing genes characterising each cluster.
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 1423
## Calculating cluster 2128
## Calculating cluster 3382
A lot of these genes are cell cycle genes, but we can still see that msPHATE clusters 1 in both cases correspond to connecting chromaffins, both marked by EGFLAM and RP11-123M6.2. CUX2, ALK, BNC2, are the neuroblast markers, both by jansky_idents and msPHATE clusters (cluster 2 in both cases).
Now we can consider clusters 3. From the heatmaps, it seems that only PLXNA2 is common to both. Note also that it is separated from chromaffins much more clearly in the scaled data, and is quite similar to cluster 5 (mostly aligning with cycling neuroblasts) Let’s look into this more precisely.
bcn_ms_unscaled_lvl4_markers %>% rownames_to_column() %>% filter(cluster == '3') %>% column_to_rownames() %>% top_n(n = 50, wt = avg_log2FC) -> top50_bcn_ms_unscaled_lvl4_clus3_markers
bcn_ms_scaled_lvl4_markers %>% rownames_to_column() %>% filter(cluster == '3') %>% column_to_rownames() %>% top_n(n = 50, wt = avg_log2FC) -> top50_bcn_ms_scaled_lvl4_clus3_markers
clus3_markers <- intersect(top50_bcn_ms_unscaled_lvl4_clus3_markers$gene, top50_bcn_ms_scaled_lvl4_clus3_markers$gene)
clus3_markers
## [1] "PLXNA2" "ZMAT4" "HTR1E" "CENPK"
## [5] "DIAPH3" "RFC3" "LRP4" "HIST1H4C"
## [9] "GFRA1" "ALK" "C21orf58" "DTL"
## [13] "ATAD2" "TOP2A" "ARHGAP11B" "PDZRN3"
## [17] "NELL1" "FANCA" "HELLS" "ADAMTS2"
## [21] "EYA1" "BRIP1" "MMS22L" "MIS18BP1"
## [25] "LINC00669" "RP11-267N12.1" "VIM"
TRUE %in% (clus3_markers %in% (bcn_ms_unscaled_lvl4_markers %>% rownames_to_column() %>% filter(cluster == '1') %>% column_to_rownames() %>% top_n(n = 50, wt = avg_log2FC)))
## [1] FALSE
TRUE %in% (clus3_markers %in% (bcn_ms_unscaled_lvl4_markers %>% rownames_to_column() %>% filter(cluster == '3') %>% column_to_rownames() %>% top_n(n = 50, wt = avg_log2FC)))
## [1] FALSE
TRUE %in% (clus3_markers %in% (bcn_ms_scaled_lvl4_markers %>% rownames_to_column() %>% filter(cluster == '1') %>% column_to_rownames() %>% top_n(n = 50, wt = avg_log2FC)))
## [1] FALSE
TRUE %in% (clus3_markers %in% (bcn_ms_scaled_lvl4_markers %>% rownames_to_column() %>% filter(cluster == '3') %>% column_to_rownames() %>% top_n(n = 50, wt = avg_log2FC)))
## [1] FALSE
bcn_ms_unscaled_lvl4_markers[clus3_markers, ][order(bcn_ms_unscaled_lvl4_markers[clus3_markers, ]$avg_log2FC, decreasing = TRUE), ]
bcn_ms_scaled_lvl4_markers[clus3_markers, ][order(bcn_ms_scaled_lvl4_markers[clus3_markers, ]$avg_log2FC, decreasing = TRUE), ]
Are these genes important? How is this new subgroup functionally distinct from proper neuroblasts and connecting chromaffins? Is this group specific to a particular timepoint? Note that stages CS18-CS23 correspond to approximately 8-10 post-conception weeks.
DimPlot(bcn_ms, reduction = 'msphate_unscaled', group.by = 'timepoint2', pt.size = 1) + ggtitle('bcn_ms msphate_unscaled by timepoint')
DimPlot(bcn_ms, reduction = 'msphate_scaled', group.by = 'timepoint2', pt.size = 1) + ggtitle('bcn_ms msphate_unscaled by timepoint')
This new group is not distinct at any one timepoint (unlike neuroblasts, which are clearly only present at CS23).
To finalise the analysis of this group, we can visualise the expression of these marker genes on the embedding. After testing a few options, only LINC00669 is clearly expressed differently in cluster 3.
FeaturePlot(bcn_ms, reduction = 'msphate_scaled', features = 'LINC00669', cols = c('grey', 'red'), pt.size = 1, combine=TRUE) + coord_cartesian(xlim = c(-0.02, 0.02), ylim = c(-0.012, 0.016)) + ggtitle('msphate_scaled by clus3_marker')
FeaturePlot(bcn_ms, reduction = 'msphate_unscaled', features = 'LINC00669', cols = c('grey', 'red'), pt.size = 1, combine=TRUE) + coord_cartesian(xlim = c(-0.02, 0.02), ylim = c(-0.012, 0.016)) + ggtitle('msphate_scaled by clus3_marker')
Far from conclusive, but this is the kind of process I would use to identify interesting genes in an scRNA-seq dataset.
Towards the end of this project, I received a combined dataset of Jansky et al. and Kameneva et al.
Let’s import it!
And plot it! The legend is a little messy because the two studies referred tp the same cell types by different names (sympathoblasts = neuroblasts).
DimPlot(comb, reduction = 'umap_scaled', group.by = 'dataset', pt.size = 1.5) + ggtitle('comb umap by dataset')
DimPlot(comb, reduction = 'umap_scaled', group.by = 'fate', pt.size = 1.5) + ggtitle('comb umap by fate')
The connection between chromaffins and neuroblasts disappeared in this embedding, which is problematic because it is the trajectory we are interested in. However, it may reappear with later zooming in or when using a different dimensionality reduction method.
Such as msPHATE!
DimPlot(comb, reduction = 'msphate_scaled', group.by = 'msphate_clusters_lvl4_scaled', pt.size = 1.5) + ggtitle('comb msphate_scaled at lvl4')
DimPlot(comb, reduction = 'msphate_unscaled', group.by = 'msphate_clusters_lvl4_unscaled', pt.size = 1.5) + ggtitle('comb msphate_unscaled at lvl4')
DimPlot(comb, reduction = 'msphate_unscaled', group.by = 'msphate_clusters_lvl4_scaled', pt.size = 1.5) + ggtitle('comb msphate_unscaled by msphate scaled lvl 4 clusters')
What’s going on here? Only when using scaled combined data with msPHATE did a strange appendix appear. Is it an issue that arose when merging the datasets? A technical issue with the Kameneva et al. data? An artefact of msPHATE? This last view is supported by the three odd clusters centering on a little appendix in the centre of the unscaled msPHATE embedding. Then again, the growth of this “appendix” may be the result of scaling alone.
Many thanks to Dr Katherine Pillman for directing my work and teaching me all I know about bioinformatics Also thanks to Dr Nick Warnock and David Lawrence (also Centre for Cancer Biology at UniSA) for insightful discussions, code suggestions, etc